Predicting the Existence of Exoplanets

¶

Sean Hughes

¶

exoplanet animation

Stellar systems in blue, confirmed exoplanetary systems in red


Introduction

¶

Exoplanet detection is a critical aspect of modern astronomy / astrophysics. Currently, there are around 5500 confirmed exoplanets, and around 7000 candidate exoplanets. The main method used to discover exoplanets is via transit of the planet across the star, however this occurence is extremely unlikely, as it requires the orbital plane of the planetary system to be nearly edge-on as observed by telescopes in the solar system. Furthermore, the transit of a planet across a star takes a very small fraction of time compared to the period of the planet. Consequently, even if an eligible planetary system is being observered, it is unlikely a transit will be detected, let alone multiple transits at regular intervals, which is required to deduce exoplanet existence. Although transit accounts for the large majority of exoplanet discoveries, these limiations result in nearly all exoplanets in surrounding planetary systems to go undiscovered.

Understanding the existence/number of exoplanets that should be in a given system, is extremely important to astronomical research/science, as it allows researchers to have a better understanding of what factors play into the formation of exoplanets, and allows scientists to have a deeper understanding of our universe and potential extraterrestrial life.

This project's goal is to create a machine learning model that allows us to predict the existence, and if so number, of exoplanets in a given stellar system. We are able to do that using the assumption that similar number of planets will form in similar stellar systems. For example, if a stellar system is older, it could be more likely to have exoplanets. We will test this hypothesis, as well as look at many other variables, in the steps outlined below.

Data Collection and Cleaning

¶

We will be using Python, as well as Jupyter Notebooks to visualize and develop our code. We will begin by collecting, and cleaning the data we will use for this project. First, lets import any necessary libraries. We will be using: numpy pandas matplotlib seaborn scipy scikit-learn

In [12]:
#imports
#standard libraries for data science
import numpy as np 
import pandas as pd

#python visualization/plotting libraries
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import seaborn as sns

#regression/machine learning libraries
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor

Downloading/Importing Data

¶

For this project, the data sources we are using are:

NASA's Exoplanet Archive - for exoplanetary data
Henry Draper (HD) Catalogue - for stellar data

After downloading these datasets from the website, we will begin to process it. We will convert the Exoplanet Archive catalogue dataset to a Pandas dataframe, and remove the redundant "HD" from it's 'hd_name' column, and generally clean it up, which will allowsus to cross reference the data to the Henry Draper data.

In [ ]:
#data processing
#csv to pandas
exoDf = pd.read_csv('ExoplanetData.csv')

#remove HD from HD name column, redundant, also remove A
exoDf['hd_name'] = exoDf['hd_name'].str.extract('(\d+)', expand=False)
exoDf['hd_name'] = pd.to_numeric(exoDf['hd_name'], errors='coerce').fillna(exoDf['hd_name'])
exoDf['hd_name'] = exoDf['hd_name'].astype('Int64')

Lets do the same with the HD Catalogue data.

In [ ]:
#csv
hdDf = pd.read_csv('HDCatalogue.csv', sep=';')

Currently there are 121 different columns, lets reduce this to some usable ones. lets keep age, mass, name, etc,
and anything that can be applied for regression, leaving out unneeded data like flags, discovery methods, and so on.

The variables that we have chosen to keep, although all might not be necessary, are:
pl_name - The name of the exoplanet
hostname - The name of the host star
hd_name - The Henry Draper (HD) cataloge ID
sy_snum - The number of stars in the system
sy_pnum - The number of planets in the system
sy_mnum - The number of moons of the planet
st_spectype - The spectral type of the host star
st_teff - The effective temperature of the host star (K)
st_rad - The radius of the host star (solar radii)
st_mass - The mass of the host star (solar masses)
st_logg - The surface gravity of the host star (log10(cm/s^2))
st_age - The age of the host star (billions of years)
st_dens - The density of the host star (gm/cm^3)
ra - Right ascension (degrees)
dec - Declination (degrees)
sy_dist - Distance (parsecs) \

Note that we are keeping alot of data about the star in the system, as like mentioned before we are going to rely on information about the star to make assumptions on the number of exoplanets in that system.

In [7]:
columns_to_keep = ['pl_name', 'hostname', 'hd_name',  'sy_snum', 'sy_pnum', 
                'sy_mnum', 'st_spectype', 'st_teff', 'st_rad', 'st_mass',
                'st_logg', 'st_age', 'st_dens', 'ra', 'dec', 'sy_dist',]

exoDf = exoDf[columns_to_keep]

We have a very small amount of na/missing exoplanet data, so we will use mean imputation to fix that.

In [13]:
columns_to_impute = ['st_age', 'st_mass', 'st_dens', 'st_rad', 'st_teff', 'st_logg', 'sy_dist']
for column in columns_to_impute:
    exoDf[column] = exoDf[column].fillna(exoDf[column].mean())

Data Visualization/Exploration

¶

Now that we have our data cleaned, and properly stored in our two pandas dataframes exoDf and hdDf, let's start with some exploratory data visualization, to understand our data better.
I am going to begin by creating a graph, of all of the stars in the Henry Draper Catalogue, as well as all of the stars with confirmed exoplanets. This will allow us to see if there is any selection bias in where the exoplanet discoveries have been made, and in general will result in a neat visualization. We can do this by utilizing the "right ascension," and "declination" values of our systems, which are the astronometrical standard for stellar coordinates.

In [10]:
#converts degree in range 0-360 to a radian in range -pi to pi
def deg_from_neg_pi_to_pi(deg):
    return (deg * np.pi / 180) - np.pi

#converts a degree in range -90 to 90 to a radian in range -pi/2 to pi/2
def deg_to_rad_90(deg):
    return (deg * np.pi / 180)

raHd = hdDf['_RAJ2000'].apply(deg_from_neg_pi_to_pi)
raExo = exoDf['ra'].apply(deg_from_neg_pi_to_pi)

decHd = hdDf['_DEJ2000'].apply(deg_to_rad_90)
decExo = exoDf['dec'].apply(deg_to_rad_90)

# Set a higher DPI for better resolution
fig = plt.figure(figsize=(15, 10), dpi=500)
ax = plt.subplot(111, projection='aitoff')

# Initialize the plot with initial data
sc_hd = ax.scatter(raHd, decHd, s=1, c='blue', alpha=0.1)
sc_exo = ax.scatter(raExo, decExo, s=1, c='red', alpha=0.3)
ax.set_xlabel('Right Ascension')
ax.set_ylabel('Declination')
ax.set_title('Systems with Confirmed Exoplanets')
ax.grid(True)
ax.set_facecolor('white')

# Update function for animation
def update(frame):
    global sc_hd, sc_exo
    # Shift in radians
    shift_radians = np.deg2rad(frame) % (2 * np.pi)
    
    # Update the RA for HD stars
    new_ra_hd = ((raHd + shift_radians + np.pi) % (2 * np.pi)) - np.pi
    # Update the RA for exoplanets
    new_ra_exo = ((raExo + shift_radians + np.pi) % (2 * np.pi)) - np.pi
    
    # Update the scatter plot data
    sc_hd.set_offsets(np.column_stack((new_ra_hd, decHd)))
    sc_exo.set_offsets(np.column_stack((new_ra_exo, decExo)))

    return sc_hd, sc_exo

#uncomment if you want to make the animation!
#ani = animation.FuncAnimation(fig, update, frames=np.arange(0, 360, 3), interval = 100, blit=True, repeat=True)
#ani.save('stars_animation.gif', writer='ffmpeg', dpi=200)

plt.show()
No description has been provided for this image

In the graph, we see confirmed exoplanetary systems in red, and stars in blue.

We can then animate this, to visualize these stars moving in the night sky.

exoplanet animation

As far as or analysis goes, this graph is very interesting!
For context, our Solar System exists inside of the Milky Way Galaxy. If you look up at the stars on a dark night, it might be possible for you to see a band of stars. This band appears as our Solar System is located on the edge of the galaxy, and we are "looking in." This band is represented in our visualization as the thick band of blue. We see an increased observation in that region, because there are simply much more stars in that direction, as most observable stars are in our own galaxy. We can also see a faint band of red, running opposite to the thick blue band. This faint of red is "looking out" of the galaxy. It would make sense that there are more exoplanets discovered "looking out," as there is much less interference in that direction. The large blue square patches are "HD Catalogue Extensions," which were added after Henry Draper's original catalogue was released, to include even more popular stars. Perhaps most interestingly, the large red patch of confirmed exoplanets is the Kepler field. This is the field that the Kepler Space Telescope observed during its 4 year mission duration. We can see an increased amount of exoplanets confirmed there, because of this lengthy observation.

Knowing this, let's make another visualization, of the relative distance from Earth to all currently discovered exoplanets.

In [7]:
#convert to cartesian coordinates
x = exoDf['sy_dist'] * np.cos(np.deg2rad(exoDf['dec'])) * np.cos(np.deg2rad(exoDf['ra']))
y = exoDf['sy_dist'] * np.cos(np.deg2rad(exoDf['dec'])) * np.sin(np.deg2rad(exoDf['ra']))
z = exoDf['sy_dist'] * np.sin(np.deg2rad(exoDf['dec']))


# Create a 3D scatter plot
fig = plt.figure(figsize=(12, 8), dpi=200)
ax = fig.add_subplot(111, projection='3d')

#plot
scatter = ax.scatter(x, y, z, c='red', marker='o', alpha=0.1)

# Set labels and title
ax.set_title('Cartesian plot of Confirmed Exoplanet Systems')

ax.set_xlim([-1500, 1500])
ax.set_ylim([-8000, 8000])
ax.set_zlim([-5000, 5000])

ax.set_xticklabels([])
ax.set_yticklabels([])
ax.set_zticklabels([])


def update(frame):
    ax.view_init(elev=10, azim=frame)  # Rotate the view
    return scatter,

#uncomment if you want to make the animation!
#ani = animation.FuncAnimation(fig, update, frames=np.arange(0, 360, 2), interval=100)
#ani.save('circular_3d_plot.gif', writer='pillow', dpi=200, savefig_kwargs={"transparent": True})

plt.show()
No description has been provided for this image

In the graph, we can see confirmed exoplanets, and their relative distance from Earth, located in the center.

We can then animate this, to visualize these stars moving in the night sky.

cartesian animation

Again, this is a very interesting visual. Is explained above we can see that the Kepler field is now prominent in this graph as the conglomerate of discovered exoplanets, pertruding outwards and down from the center (Earth). The rest of the exoplanets are seen to follow the plane of our Milky Way Galaxy.

Now, let's look at our data, and see if we can use it for regression. Let's first observe the spectral type, and relate it to the average number of exoplanets.

In [ ]:
#extract just the spectral type symbol
exoDf['st_spectype'] = exoDf['st_spectype'].str[:2]

#group by spectype and find the mean of planets of that spectype
avg_exoplanets = exoDf.groupby('st_spectype')['sy_pnum'].mean()

#sort highest to lowest
avg_exoplanets_sorted = avg_exoplanets.sort_values(ascending=False)

plt.figure(figsize=(15, 10))
avg_exoplanets_sorted.plot(kind='bar', color='green')
plt.xlabel('Spectral Type')
plt.ylabel('Average Number of Exoplanets')
plt.title('Average Number of Exoplanets by Spectral Type')
plt.xticks(rotation=45)
plt.grid(axis='y')
plt.tight_layout()
plt.show()
No description has been provided for this image

We can see that there appears to be some sort of correlation between spectral type and number of exoplanets, we can use this in our model if we wish.

Let's move on by now comparing stellar age, stellar mass, stellar density, stellar radius, stellar effective temperature, and stellar surface gravity, to the number of exoplanets in a system.

In [ ]:
#relate age with num of exoplanets

#mean num of exoplanets for each age
mean_exoplanets_by_age = exoDf.groupby('st_age')['sy_pnum'].mean()

# Filter the DataFrame to include only observations with stellar mass between 0 and 5 
#solar masses, this eliminates outliers in our data
filtered_exoDf = exoDf[(exoDf['st_mass'] >= 0) & (exoDf['st_mass'] <= 5)]
mean_exoplanets_by_mass = filtered_exoDf.groupby('st_mass')['sy_pnum'].mean()

filtered_exoDf = exoDf[(exoDf['st_dens'] >= 0) & (exoDf['st_dens'] <= 150)]
mean_exoplanets_by_density = filtered_exoDf.groupby('st_dens')['sy_pnum'].mean()

mean_exoplanets_by_radius = exoDf.groupby('st_rad')['sy_pnum'].mean()

filtered_exoDf = exoDf[(exoDf['st_teff'] >= 0) & (exoDf['st_teff'] <= 10000)]
mean_exoplanets_by_teff = filtered_exoDf.groupby('st_teff')['sy_pnum'].mean()

filtered_exoDf = exoDf[(exoDf['st_logg'] >= -2) & (exoDf['st_logg'] <= 6)]
mean_exoplanets_by_logg = filtered_exoDf.groupby('st_logg')['sy_pnum'].mean()

dataframes = [mean_exoplanets_by_age, mean_exoplanets_by_mass, mean_exoplanets_by_density,
              mean_exoplanets_by_radius, mean_exoplanets_by_teff, mean_exoplanets_by_logg]

variables = ['Stellar Age', 'Stellar Mass', 'Stellar Density', 'Stellar Radius', 'Stellar Teff', 'Stellar Logg']

# Create subplots
fig, axs = plt.subplots(3, 2, figsize=(15, 15), dpi=500)

# Flatten the axs array
axs = axs.flatten()

# Loop over each DataFrame and subplot
for i, (df, variable) in enumerate(zip(dataframes, variables)):
    # Perform linear regression
    slope, intercept, r_value, p_value, std_error = stats.linregress(df.index, df.values)
    
    # Plot the data and regression line
    axs[i].scatter(df.index.values, df.values, marker='o', color='black', alpha=0.5)
    axs[i].plot(df.index.values, intercept + slope * df.index.values, color='red', linestyle='-', linewidth=2)
    
    # Set plot title and labels
    axs[i].set_title(f'{variable} vs Number of Exoplanets, p={p_value}')
    axs[i].set_xlabel(variable)
    axs[i].set_ylabel('Number of Exoplanets')
    axs[i].grid(True)

# Adjust layout
plt.tight_layout()
plt.show()
No description has been provided for this image

These visualizations allow us to understand the relationship between each of these variables, and the number of exoplanets in a system. We also calculated a regression line for each of these relationships, as well as a p value. For each of these relationships, the p value was well below the standard 0.05, in which case we are able to reject the null hypothesis, and conclude that there is a statistically significant relationship between these variables, and the dependent variable, the number of exoplanets.

Let's move on by applying these 6 terms to a regression model, so we can predict the number of exoplanets in a given system. We are going to try a few models, and see which one performs the best. Notably, let's try:
Linear Regression, ElasticNet, Random Forest, and Extra Trees.
This variability between linear regression models (Linear and ElasticNet), and tree based regression models (Random Forest and Extra Trees), will give us a brief insight into the performance of an optimal model.

In [ ]:
#define features and target
X = exoDf[['st_age', 'st_mass', 'st_dens', 'st_rad', 'st_teff', 'st_logg']]
y = exoDf['sy_pnum']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#model = LinearRegression()
#model = DecisionTreeRegressor()

model = RandomForestRegressor(n_estimators=100)
model2 = ExtraTreesRegressor(n_estimators=200)
model3 = LinearRegression()
model4 = ElasticNet(alpha=0.1, l1_ratio=0.5)
model5 = DummyRegressor(strategy='mean')

#model = ExtraTreesClassifier(n_estimators=100)
model.fit(X_train, y_train)
model2.fit(X_train, y_train)
model3.fit(X_train, y_train)
model4.fit(X_train, y_train)
model5.fit(X_train, y_train)



# Predict the number of exoplanets for the test set
y_pred = model.predict(X_test)
y_pred2 = model2.predict(X_test)
y_pred3 = model3.predict(X_test)
y_pred4 = model4.predict(X_test)
y_pred5 = model5.predict(X_test)

# Calculate and print metrics
mse = mean_squared_error(y_test, y_pred)
mse2 = mean_squared_error(y_test, y_pred2)
mse3 = mean_squared_error(y_test, y_pred3)
mse4 = mean_squared_error(y_test, y_pred4)
mse5 = mean_squared_error(y_test, y_pred5)

r2 = r2_score(y_test, y_pred)
r22 = r2_score(y_test, y_pred2)
r23 = r2_score(y_test, y_pred3)
r24 = r2_score(y_test, y_pred4)
r25 = r2_score(y_test, y_pred5)

print(f"Random Forest MSE: {mse}")
print(f"Random Forest R-squared: {r2}")
print(f"Extra Trees MSE: {mse2}")
print(f"Extra Trees R-squared: {r22}")
print(f"Linear Regression MSE: {mse3}")
print(f"Linear Regression R-squared: {r23}")
print(f"ElasticNet MSE: {mse4}")
print(f"ElasticNet R-squared: {r24}")
print(f"Dummy (mean) MSE: {mse5}")
print(f"Dummy (mean) R-squared: {r25}")

#combine into dataframe for plotting
plot_data = pd.DataFrame({
    'Actual': y_test,
    'Random Forest': y_pred,
    'Extra Trees': y_pred2,
    'Linear Regression': y_pred3,
    'Stochastic Gradient Descent': y_pred4,
    'Dummy (mean)': y_pred5
})

# Melt the DataFrame to long format for Seaborn
plot_data = plot_data.reset_index(drop=True)
plot_data_melted = plot_data.melt(var_name='Type', value_name='Number of Exoplanets')

# Plot the results
plt.figure(figsize=(15, 10), dpi=200)
sns.violinplot(x='Type', y='Number of Exoplanets', data=plot_data_melted, palette='Set2')
plt.xlabel('')
plt.ylabel('Number of Exoplanets')
plt.title('Actual vs Predicted Number of Exoplanets')
plt.grid(True)
plt.show()
Random Forest MSE: 0.3199367634819339
Random Forest R-squared: 0.7798217920302237
Extra Trees MSE: 0.2573534284124437
Extra Trees R-squared: 0.822891198666734
Linear Regression MSE: 1.4288107134911496
Linear Regression R-squared: 0.016702616477324184
ElasticNet MSE: 1.4352209391935493
ElasticNet R-squared: 0.012291144683724275
Dummy (mean) MSE: 2.240789656610594
Dummy (mean) R-squared: -0.5420955243162937
No description has been provided for this image

We can see that the performance of the tree based regressors are much better than the linear regressors. This makes sense as our data is very complex and non-linear, and the trees allow for a better fit on this type of data. We can also very easily see that our models are a much better fit than random, with a dummy regressor guessing the mean for each value. This dummy regressor has a MSE of ~2.2, and a R-squared value of -0.5. This is clearly VERY bad, and a VERY bad fit. Let's remove the random dummy regressor, and both linear regressors, as the trees regressors are much better, and plot again.

In [ ]:
#remove linear/elasticnet
plot_data = pd.DataFrame({
    'Actual': y_test,
    'Random Forest': y_pred,
    'Extra Trees': y_pred2,
})

# Melt the DataFrame to long format for Seaborn
plot_data = plot_data.reset_index(drop=True)
plot_data_melted = plot_data.melt(var_name='Type', value_name='Number of Exoplanets')

#print MSEand r^2 values
print(f"Random Forest MSE: {mse}")
print(f"Random Forest R-squared: {r2}")
print(f"Extra Trees MSE: {mse2}")
print(f"Extra Trees R-squared: {r22}")

# Plot the results
plt.figure(figsize=(15, 10), dpi=200)
sns.violinplot(x='Type', y='Number of Exoplanets', data=plot_data_melted, palette='Set2')
plt.xlabel('')
plt.ylabel('Number of Exoplanets')
plt.title('Actual vs Predicted Number of Exoplanets')
plt.grid(True)
plt.show()
Random Forest MSE: 0.3199367634819339
Random Forest R-squared: 0.7798217920302237
Extra Trees MSE: 0.2573534284124437
Extra Trees R-squared: 0.822891198666734
No description has been provided for this image

We can see that our tree based regressors, random forest and extra trees, are quite good fits for the data. Better of the two Extra Trees, has a MSE of ~0.25, and an R-squared value ~ 0.8. This indicates that the model is a really good fit for our data, and is quite good at predicting the number of exoplanets in a planetary system. It is clear that the extra trees regressor would have an advantage over the random forest in this dataset, as extra trees as an extra layer of randomness, and allows for more elimination of outliers and noise, which this data has a good amount of.

Lets try to make another model, and use only the parameters that the HD catalogue also shares with the exoplanetary data. These parameters are magnitude, and spectral type. Lets begin by classifiying the spectral type in terms of their surface temperate: O > B > A > F > G > K > M > C, and see if their is a relationship to number of planets, and if it is significant.

In [111]:
# Drop NA spectral types
num_specs = exoDf.dropna(subset=['st_spectype'])

# Extract just the spectral type symbol
exoDf['st_spectype'] = exoDf['st_spectype'].str[:2]

# Group by spectype and find the mean of planets of that spectype
avg_exoplanets = exoDf.groupby('st_spectype')['sy_pnum'].mean()

order = ['O', 'B', 'A', 'F', 'G', 'K', 'M', 'C']
# Create a dictionary for quick lookup of the order
order_dict = {char: index for index, char in enumerate(order)}
def custom_sort_key(string):
    string = str(string)
    if not string or string[0] not in order_dict:
        return (len(order), '' , string)
    first_char = string[0]
    rest = string[1:] if len(string) > 1 else ''
    
    first_char_order = order_dict[first_char]
    
    # Determine the sort key for the second character or number
    if rest == '':
        second_char_order = (-1, '')  # Single letters come first
    elif rest.isdigit():
        second_char_order = (0, int(rest))  # Numbers come before letters
    elif rest in order_dict:
        second_char_order = (1, order_dict[rest])  # Maintain the custom order for letters
    else:
        second_char_order = (2, rest)  # Anything else

    return (first_char_order, second_char_order)

# Sort highest to lowest
avg_exoplanets_sorted = avg_exoplanets.loc[sorted(avg_exoplanets.index, key=custom_sort_key)]

#Extract unique spectral types and sort them using the custom sort key
unique_spectypes = num_specs['st_spectype'].unique()
sorted_spectypes = sorted(unique_spectypes, key=custom_sort_key)

# Create a mapping from sorted spectral types to numbers
spectype_to_num = {spectype: idx for idx, spectype in enumerate(sorted_spectypes)}

# Reassign the spectral types in the original DataFrame
num_specs.loc[:, 'st_spectype'] = num_specs['st_spectype'].map(spectype_to_num)

avg_exoplanets = num_specs.groupby('st_spectype')['sy_pnum'].mean()

# Create a mapping from sorted spectral types to numbers
spectype_to_num = {spectype: idx for idx, spectype in enumerate(sorted_spectypes)}

# Perform linear regression for the first plot
slope, intercept, r_value, p_value, std_err = stats.linregress(avg_exoplanets.index, avg_exoplanets)

# Plot the first subplot (scatter plot with regression line)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Subplot 1: Scatter plot with regression line
ax1.scatter(avg_exoplanets.index, avg_exoplanets, color='green', label='Data')
ax1.plot(avg_exoplanets.index.values, intercept + slope * avg_exoplanets.index.values, color='blue', label='Regression Line')
ax1.set_xlabel('Spectral Type')
ax1.set_ylabel('Average Number of Exoplanets')
ax1.set_title('Average number of exoplanets by spectral type')
ax1.grid(axis='y')
ax1.legend()

# Perform linear regression for the second plot
slope2, intercept2, r_value2, p_value2, std_err2 = stats.linregress(num_specs['sy_vmag'], num_specs['sy_pnum'])

print("p_value for spectral type/number of planets: " + str(p_value))
print("p_value for apparent magnitude/number of planets " + str(p_value2))

# Subplot 2: Scatter plot of apparent magnitude vs. number of planets with regression line
ax2.scatter(num_specs['sy_vmag'], num_specs['sy_pnum'], color='green',)
ax2.plot(num_specs['sy_vmag'].values, intercept2 + slope2 * num_specs['sy_vmag'].values, color='blue', label='Regression Line')
ax2.set_xlabel('Apparent Magnitude')
ax2.set_ylabel('Number of Planets')
ax2.set_title('Number of exoplanets by apparent (V) magnitude')
ax2.grid(axis='both')
ax2.legend()

plt.tight_layout()
plt.show()
p_value for spectral type/number of planets: 0.0070388507998116676
p_value for apparent magnitude/number of planets 0.08935368277530657
No description has been provided for this image

We can see there is a stastistical significance between the spectral type, and the number of exoplanets. We an also see this has a very low p value, of around 0.007. We can see that there is no statistical signifance between the apparent magnitude (v magnitude) and the number of planets, but we are going to leave it in our regression, as this may help our model to better predict, when in relation to the spectral type. Spectral type and apparent magnitude are directly related, so we shouldn't leave magnitude out of our model just because it lacks signifance by itself.

In [60]:
#define features and target
num_specs.loc[:, 'sy_vmag'] = num_specs['sy_vmag'].fillna(num_specs['sy_vmag'].mean())
X = num_specs[['st_spectype', 'sy_vmag']]
y = num_specs['sy_pnum']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#model = LinearRegression()
#model = DecisionTreeRegressor()

model = RandomForestRegressor(n_estimators=100)
model2 = ExtraTreesRegressor(n_estimators=200)
model3 = LinearRegression()
model4 = ElasticNet(alpha=0.1, l1_ratio=0.5)
model5 = DummyRegressor(strategy='mean')

#model = ExtraTreesClassifier(n_estimators=100)
model.fit(X_train, y_train)
model2.fit(X_train, y_train)
model3.fit(X_train, y_train)
model4.fit(X_train, y_train)
model5.fit(X_train, y_train)



# Predict the number of exoplanets for the test set
y_pred = model.predict(X_test)
y_pred2 = model2.predict(X_test)
y_pred3 = model3.predict(X_test)
y_pred4 = model4.predict(X_test)
y_pred5 = model5.predict(X_test)

# Calculate and print metrics
mse = mean_squared_error(y_test, y_pred)
mse2 = mean_squared_error(y_test, y_pred2)
mse3 = mean_squared_error(y_test, y_pred3)
mse4 = mean_squared_error(y_test, y_pred4)
mse5 = mean_squared_error(y_test, y_pred5)

r2 = r2_score(y_test, y_pred)
r22 = r2_score(y_test, y_pred2)
r23 = r2_score(y_test, y_pred3)
r24 = r2_score(y_test, y_pred4)
r25 = r2_score(y_test, y_pred5)

print(f"Random Forest MSE: {mse}")
print(f"Random Forest R-squared: {r2}")
print(f"Extra Trees MSE: {mse2}")
print(f"Extra Trees R-squared: {r22}")
print(f"Linear Regression MSE: {mse3}")
print(f"Linear Regression R-squared: {r23}")
print(f"ElasticNet MSE: {mse4}")
print(f"ElasticNet R-squared: {r24}")
print(f"Dummy (mean) MSE: {mse5}")
print(f"Dummy (mean) R-squared: {r25}")

#combine into dataframe for plotting
plot_data = pd.DataFrame({
    'Actual': y_test,
    'Random Forest': y_pred,
    'Extra Trees': y_pred2,
    'Linear Regression': y_pred3,
    'Stochastic Gradient Descent': y_pred4,
    'Dummy (mean)': y_pred5
})

# Melt the DataFrame to long format for Seaborn
plot_data = plot_data.reset_index(drop=True)
plot_data_melted = plot_data.melt(var_name='Type', value_name='Number of Exoplanets')

# Plot the results
plt.figure(figsize=(15, 10), dpi=200)
sns.violinplot(x='Type', y='Number of Exoplanets', data=plot_data_melted, palette='Set2')
plt.xlabel('')
plt.ylabel('Number of Exoplanets')
plt.title('Actual vs Predicted Number of Exoplanets')
plt.grid(True)
plt.show()
Random Forest MSE: 0.19722707259069316
Random Forest R-squared: 0.8536779423325951
Extra Trees MSE: 0.18147731170592116
Extra Trees R-squared: 0.8653626334358904
Linear Regression MSE: 1.2239865773686935
Linear Regression R-squared: 0.09192874890177316
ElasticNet MSE: 1.2236877740833287
ElasticNet R-squared: 0.09215042998732614
Dummy (mean) MSE: 1.3574024841193293
Dummy (mean) R-squared: -0.007052033730584606
No description has been provided for this image

This is great! We see that this model has an even better fit of the data than our previous model. The best performer is again the Extra Trees model, with a MSE of ~0.18, and a r_squared value of ~0.86. This again is a very good fit for the model. Lets take a closer look at this new model Extra Trees graph.

In [62]:
#combine into dataframe for plotting
plot_data = pd.DataFrame({
    'Actual': y_test,
    'Extra Trees': y_pred2,
})

# Melt the DataFrame to long format for Seaborn
plot_data = plot_data.reset_index(drop=True)
plot_data_melted = plot_data.melt(var_name='Type', value_name='Number of Exoplanets')

print(f"Extra Trees MSE: {mse2}")
print(f"Extra Trees R-squared: {r22}")

# Plot the results
plt.figure(figsize=(15, 10), dpi=200)
sns.violinplot(x='Type', y='Number of Exoplanets', data=plot_data_melted, palette='Set2')
plt.xlabel('')
plt.ylabel('Number of Exoplanets')
plt.title('Actual vs Predicted Number of Exoplanets')
plt.grid(True)
plt.show()
Extra Trees MSE: 0.18147731170592116
Extra Trees R-squared: 0.8653626334358904
No description has been provided for this image

Great! This is an even more amazing fit of our data, with a MSE of only 0.181, and an r_squared of 0.865. Now, lets convert the spectral type of the HD catalogue stars to a usable integer. So we can predict on those stars.

In [101]:
#drop na spectral types
hd_specs = hdDf.dropna(subset=['SpT'])

#extract just the spectral type symbol
hd_specs['SpT'] = hd_specs['SpT'].str[:2]


order = ['O', 'B', 'A', 'F', 'G', 'K', 'M', 'C']
# Create a dictionary for quick lookup of the order
order_dict = {char: index for index, char in enumerate(order)}
def custom_sort_key(string):
    string = str(string)
    if not string or string[0] not in order_dict:
        return (len(order), '' , string)
    first_char = string[0]
    rest = string[1:] if len(string) > 1 else ''
    
    first_char_order = order_dict[first_char]
    
    # Determine the sort key for the second character or number
    if rest == '':
        second_char_order = (-1, '')  # Single letters come first
    elif rest.isdigit():
        second_char_order = (0, int(rest))  # Numbers come before letters
    elif rest in order_dict:
        second_char_order = (1, order_dict[rest])  # Maintain the custom order for letters
    else:
        second_char_order = (2, rest)  # Anything else

    return (first_char_order, second_char_order)

# Extract unique spectral types and sort them using the custom sort key
unique_spectypes = hd_specs['SpT'].unique()
sorted_spectypes = sorted(unique_spectypes, key=custom_sort_key)

# Create a mapping from sorted spectral types to numbers
spectype_to_num = {spectype: idx for idx, spectype in enumerate(sorted_spectypes)}

# Reassign the spectral types in the original DataFrame
hd_specs.loc[:, 'SpT'] = hd_specs['SpT'].map(spectype_to_num)

Now let's make predictions on the HD catalogue stars.

In [125]:
#convert hd_specs Ptm (V mag) to int
hd_specs['Ptm'] = pd.to_numeric(hd_specs['Ptm'], errors='coerce')

# Impute missing values with the mean of the column
mean = hd_specs['Ptm'].mean()
hd_specs['Ptm'] = hd_specs['Ptm'].fillna(mean)


X_hd = hd_specs[['SpT', 'Ptm']]

predictions_hd = model2.predict(X_hd)

Great! Now we have our predictions of our stars, using a modification from our graph earlier, let's make a color map to show how many there are in that system.

In [67]:
#converts degree in range 0-360 to a radian in range -pi to pi
def deg_from_neg_pi_to_pi(deg):
    return (deg * np.pi / 180) - np.pi

#converts a degree in range -90 to 90 to a radian in range -pi/2 to pi/2
def deg_to_rad_90(deg):
    return (deg * np.pi / 180)

raHd = hdDf['_RAJ2000'].apply(deg_from_neg_pi_to_pi)
raExo = exoDf['ra'].apply(deg_from_neg_pi_to_pi)

decHd = hdDf['_DEJ2000'].apply(deg_to_rad_90)
decExo = exoDf['dec'].apply(deg_to_rad_90)

# Set a higher DPI for better resolution
fig = plt.figure(figsize=(15, 10), dpi=300)
ax = plt.subplot(111, projection='aitoff')

#color map for num of exoplanets
exo_colors = exoDf['sy_pnum']

# Initialize the plot with initial data
#sc_hd = ax.scatter(raHd, decHd, s=1, c='blue', alpha=0.1)
sc_exo = ax.scatter(raExo, decExo, s=1, c=exo_colors, cmap='plasma', alpha=0.3)
ax.set_xlabel('Right Ascension')
ax.set_ylabel('Declination')
ax.set_title('Systems with Confirmed Exoplanets')
ax.grid(True)
ax.set_facecolor('white')

# Update function for animation
def update(frame):
    global sc_hd, sc_exo
    # Shift in radians
    shift_radians = np.deg2rad(frame) % (2 * np.pi)
    
    # Update the RA for HD stars
    new_ra_hd = ((raHd + shift_radians + np.pi) % (2 * np.pi)) - np.pi
    # Update the RA for exoplanets
    new_ra_exo = ((raExo + shift_radians + np.pi) % (2 * np.pi)) - np.pi
    
    # Update the scatter plot data
    sc_hd.set_offsets(np.column_stack((new_ra_hd, decHd)))
    sc_exo.set_offsets(np.column_stack((new_ra_exo, decExo)))

    return sc_hd, sc_exo

#uncomment if you want to make the animation!
#ani = animation.FuncAnimation(fig, update, frames=np.arange(0, 360, 3), interval = 100, blit=True, repeat=True)
#ani.save('stars_animation.gif', writer='ffmpeg', dpi=200)

plt.show()
No description has been provided for this image

Above is what we have with no predictions, lets fill that in with the predictions we've made with out HD catalogue data.

In [136]:
#converts degree in range 0-360 to a radian in range -pi to pi
def deg_from_neg_pi_to_pi(deg):
    return (deg * np.pi / 180) - np.pi

#converts a degree in range -90 to 90 to a radian in range -pi/2 to pi/2
def deg_to_rad_90(deg):
    return (deg * np.pi / 180)

raHd = hdDf['_RAJ2000'].apply(deg_from_neg_pi_to_pi)
raExo = exoDf['ra'].apply(deg_from_neg_pi_to_pi)

decHd = hdDf['_DEJ2000'].apply(deg_to_rad_90)
decExo = exoDf['dec'].apply(deg_to_rad_90)

# Set a higher DPI for better resolution
fig = plt.figure(figsize=(15, 10), dpi=300)
ax = plt.subplot(111, projection='aitoff')

#color map for num of exoplanets
exo_colors = exoDf['sy_pnum']
hd_colors = predictions_hd

# Initialize the plot with initial data
sc_hd = ax.scatter(raHd, decHd, s=1, c=hd_colors, cmap='rainbow', alpha=0.1)
sc_exo = ax.scatter(raExo, decExo, s=1, c=exo_colors, cmap='rainbow', alpha=0.3)
ax.set_xlabel('Right Ascension')
ax.set_ylabel('Declination')
ax.set_title('Systems with Confirmed Exoplanets')
ax.grid(True)
ax.set_facecolor('white')

# Update function for animation
def update(frame):
    global sc_hd, sc_exo
    # Shift in radians
    shift_radians = np.deg2rad(frame) % (2 * np.pi)
    
    # Update the RA for HD stars
    new_ra_hd = ((raHd + shift_radians + np.pi) % (2 * np.pi)) - np.pi
    # Update the RA for exoplanets
    new_ra_exo = ((raExo + shift_radians + np.pi) % (2 * np.pi)) - np.pi
    
    # Update the scatter plot data
    sc_hd.set_offsets(np.column_stack((new_ra_hd, decHd)))
    sc_exo.set_offsets(np.column_stack((new_ra_exo, decExo)))

    return sc_hd, sc_exo

#uncomment if you want to make the animation!
#ani = animation.FuncAnimation(fig, update, frames=np.arange(0, 360, 3), interval = 100, blit=True, repeat=True)
#ani.save('stars_animation.gif', writer='ffmpeg', dpi=200)

plt.show()
No description has been provided for this image

And with that, we can clearly see the predicted number of exoplanets using our model. In this graph, we see the stellar systems with a low number of exoplanets predicted, in the dark purple, and a higher number of planets predicted in the lighter colors.

The final model trained using an Extra Trees Regressor, to a MSE of ~ 0.181 and a r_squared value of 0.865, in order to predict the existence of exoplanets in stellar systems where they have not yet been discovered.